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■ Abstract 

(N 

We propose a simple linear-time on-line algorithm for constructing a po- 
I sition heap for a string [EMOWTT] . Our definition of position heap differs 

■ slightly from the one proposed in [EMQWTT] in that it considers the suf- 
fixes ordered in the descending order of length. Our construction is based 

\l , on classic suffix pointers and resembles Ukkonen's algorithm for sufBx trees 

[Ukk95] ■ Using suffix pointers, the position heap can be extended into the aug- 
' mented position heap that allows for a linear-time string matching algorithm 

Q ! [EMOWll] , 

, 1 Introduction 

! The theory of string algorithms developed beautiful data structures for string match- 

^ ' ing and text indexing. Among them, suffix tree and suffix array are most widely 

used structures, providing efhcient solutions for a wide range of applications |CR941 
IGus97| . The DAWG (Directed Acyclic Word Graph) [BBH+85] . also known as suf- 
fix aut omaton |Cro86] , is another elegant structure that can be used both as a text 
index |BBH+85| or as a matching automaton |Cro881 ICR94) . 
I Recently, a new position heap data structure was proposed [EMOWTT] . Similar 

to the suffix tree, DAWG or suffix array, position heap allows for a pre-processing of a 
text string in order to efficiently search for patterns in it. As for the above-mentioned 
data structures, a position heap for a string of length n can be constructed in time 
0{n). Then all locations of a pattern of length m can be found in time 0{m + occ), 
where occ is the number of occurrences. 
^ I The construction algorithm of [EMO WfT] processes the string from right to left, 

" " like Weiner's algorithm does for suffix trees |Wei73j . Moreover, the construction 

requires a so-called dual heap, which is an additional trie on the same set of nodes. 
The position heap and its dual heap are constructed simultaneously. 

To obtain a linear-time pattern matching algorithm of EMOWll] , the position 
heap should be post-processed in order to add some additional information, resulting 
in the augmented position heap. The most important element of this information 
includes so-called maximal-reach pointers assigned to certain nodes. Computing 
these pointers makes use of the dual heap too. 

In this paper, we propose a different construction of the position heap. First, 
we change the definition of the position heap by reversing the order of suffixes 
and thus allowing for the left-to-right traversal of the input string. The modified 
definition, however, preserves good properties of the position heap and does not 
affect the string matching algorithm proposed in [EMOWTT] . For this modified 
definition, we propose an on-line algorithm for constructing the position heap. Our 
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algorithm does not use the dual heap, replacing it by classic suffix pointers used 
for constructing sufBx trees by Ukkonen's algorithm j Ukk95| or for constructing the 
DAWG [BBH+85]. Our algorithm is simple and can be compared to Ukkonen's 
algorithm for sufRx trees, as opposed to Weiner's algorithm that constructs the 
suffix tree by inserting suffixes right-to- left (i.e. shortest first). We deliberately use 
some terminology of Ukkonen's algorithm to underline this similarity. 

We further show that the augmented position heap can be easily constructed us- 
ing suffix pointers. Thus, we completely eliminate the use of the dual heap, replacing 
it by suffix pointers for constructing both the position heap and its augmented ver- 
sion. Even if this replacement does not provide an immediate improvement in space 
or running time, we believe that our construction is conceptually simpler and more 
natural. 

Throughout the paper, we assume we are given a constant-size alphabet A. 
Positions of strings over A are numbered from f, that is, a string w of length k 
is . . . ^[/c]. The length /c of w is denoted by \w\. w[i..j] denotes substring 
w['^]---w[j]. 

A trie (term attributed to Fredkin [Fre60j ) is a simple natural data structure 
for storing a set of strings. It is a tree with edges labeled by alphabet letters, such 
that for any internal node, the edges leading to the children nodes are labeled by 
distinct letters. In this paper, we assume the edges to be directed towards leaves, 
and call an edge labeled by a letter a an a-edge. A label of a node (path label) is 
the string formed by the letters labeling the edges of the path from the root to this 
node. Given a trie, a string w is said to be represented in the trie if it is a path 
label of some node. The corresponding node will then be denoted by w. 

2 Definition of position heap 

To define position heaps, we first need to introduce the sequence hash tree proposed 
by Coffman and Eve back in 1970 ,_CE70 as a data structure for implementing hash 
tables. Assume we are given an ordered set of strings W — {wi, . . . ,Wn} and assume 
for now that no Wi is a prefix of wj for any j < i. The sequence hash tree for W, 
denoted SHT{W), is a trie defined by the following iterative construction. We start 
with the tree SHTo(W) consisting of a single root node roo^. We then construct 
SHT(W) by processing strings wi^ . . . ,Wk in this order and for each Wi, adding one 
node to the tree. By induction, assume that SHTi{W) is the sequence hash tree 
for {wi, . . . ,Wi}. To construct SHTi^i{W), we find the shortest prefix v of lUi+i 
which is not represented in S HTi{W) . Note that by our assumption, such a prefix 
always exists. Let v = v' a, a ^ A, i.e. v' is the longest prefix of Wi+i represented in 
SHT,{W). Then SHT,+i{W) is obtained from SHT,{W) by adding a new node as 
a child of v' connected to v' by an a-edge and pointing to Wi+i. After inserting all 
strings of W, we obtain SHT{W), that is SHT{W) ^ SHTk{W). Thus, SHT{W) 
is a trie of n -I- 1 nodes such that a node pointing to Wi is labeled by some prefix 
of Wi- Note that the size of the sequence hash tree depends only on the number 
of strings in the set and does not depend on the length of those. An example of 
sequence hash tree is given on Figure [TJ 

We now define the position heap of a string T. In |EMQWlT| . the position heap 
for T is defined as the sequence hash tree for the set of suffixes of T, where the 
suffixes are ordered in the ascending order of lengthy i.e. from right to left. This 
insures, in particular, the condition that no suffix is a prefix of a previously inserted 

^This definition agrees with the definition of ICE70I but is slightly different from that of 
|EMOWll| which defines the root to store uii. The difference is insignificant, however. 
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1. babbaab 

2. bbab 

3. ab 

4. baaa 

5. aabab 

6. babaaba 



Figure 1: Sequence hash tree for the set of strings shown on the right. Each node 
stores the rank of the corresponding string in the set. 

suffix, and then no suffix is already represented in the position heap at the time of 
its insertion. 

In this paper, we define the position heap of T to be the sequence hash tree for 
the set of suffixes of T, where the suffixes are ordered in the descending order of 
length, i.e. from left to right. From now on, we stick to this order. An immediate 
observation is that the assumption of the suffix hash tree does not hold anymore, 
and it may occur that an inserted suffix is already represented in the position heap 
by an existing node. One easy way to cope with this is to systematically assume 
that T is ended by a special sentinel symbol $, like it is generally assumed for the 
suffix tree. 

On the other hand, as we will be interested in an on-line construction of the 
position heap, we will still need to construct the position heap for strings without 
the ending sentinel symbol. For that, we have to slightly change the definition of 
sequence hash tree of a set W, by allowing one node to point to several strings of 
W. The definition of the position heap extends then to any string, with the only 
difference that inserting a suffix may no longer lead to the creation of a new node, 
but to adding a pointer to this suffix to an existing node. This feature, however, 
will be used in a very restricted way, as the following observation shows. 

Lemma 1 Let W be a set of distinct strings. Then every node of SHT{W) points 
to at most two strings of W . 

Proof: The only situation when a new pointer gets inserted to an existing node is 
when the inserted string Wi+i is already represented in S HTi{W) . Since all strings 
of W are distinct, this situation may occur only once for each node. Therefore, each 
node of SHT{W) points to one or two strings of W. □ 

As a consequence of Lemma [1] a position heap contains two types of nodes, 
pointing respectively to one and two suffixes of T. The former will be called regular 
nodes and the latter double nodes. We naturally assume that a pointer to a suffix 
is simply the starting position of that suffix, therefore regular and double nodes 
store one and two string positions respectively. Hereafter we interchangeably refer 
to "suffixes" and "positions" when the underlying string is unambiguously defined. 

Figure [5] provides an example of a position heap. 

3 Properties of position heap 

Denote by PH{T) the position heap for a string T[l..n] as defined in the previous 
section. In the following theorem, we summarize some key properties of the position 
heap. 

Theorem 1 ( |EMOWll| ) Consider PH{T [I.. n]). The following properties hold. 



3 




1 2 3 4 5 6 7 8 9 10 11 12 13 
aababbbaab a a b 



Figure 2: Position heap for string aababbbaabaab. Double nodes store pairs of 
positions. 



(i) A substring w of T is represented in PH{T) iff T contains occurrences of 
strings w[l..l\, it;[1..2], ?j;[1..3], . . . , ii'[l..|w|], appearing at increasing positions 
in this order. 

(ii) The labels of all nodes of PH{T) form a factorial set. That is, if a string is 
represented in PH(T), all its substrings are represented too. 

(Hi) The depth of PH{T) is no more than 2h{T), where h{T) is the length of the 
longest substring w of T which occurs \w\ times in T (possibly with overlap). 

(iv) If a substring w occurs in T at least \w\ times, then w is represented in PHiT). 
Inversely, if w is not represented in PH(T) and w' is the longest prefix of w 
which is represented, then w cannot occur in T more than \w'\ times. 

Proof: (i) The 'if '-part follows immediately from the definition of PH{T) and the 
left-to- right order of suffixes. If a substring w is represented in PH(T), then nodes 
?i'[l..l], w[1..2], w[1..3], w [l..|w|] have been created in this respective order. 
The creation of each such node w[l..^] has been triggered by an insertion of a suffix 
starting with ^[l..^]. Since suffixes are inserted from left to right, property (i) 
follows. 

Properties (ii)-(iv) have been established in |EMOWfT| but remain valid for our 
definition of position heap when suffixes are inserted from left to right. Actually, 
these properties are valid for any order of inserting suffixes into the position heap. 

(ii) It is sufficient to show that if some string ?ii[l..£] is represented in PII{T), 
then both w[i..i — 1] and M;[2..f] are represented too. For w[l..£ — 1], this is obvious 
from construction. For w[2..i], this can be seen from Property (i). Indeed, if strings 
u'[l..l], w[1..2], 'w;[1..3], . . . , ^[1..^] appear in T in this relative order, then we have 
strings w[2..2], w[2..3], . . . , w[2..£] appearing in T at increasing positions too. By 
Property (i), this ensures that w[2..£] is represented in PH{T). 

(Hi) Let w be one of the deepest nodes of PH{T), i.e. the depth of PH{T) is 
d — \w\. From Property (i), strings w[l..[(i/2]], u'[l..[(i/2] +1], w[\..d] occur 
at T at distinct positions, and therefore [d/2]] occurs at least [d/2] times in 
T. Then h{T) > [d/2] and the depth d of PH{T) is bounded by 2h{T). 

(iv) If a substring w occurs in T at least times, then there exist successive 
occurrences of u;[l..l], w[1..2], w[1..3],. . . ,?«[l..|w|], and, by Property (i), w is rep- 
resented in PH{T). Assume now that w is not represented in PH{T) and w' is 
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Figure 3: Updating secondary position i when transforming PH{T[l..k]) into 
PH{T[l..k + 1]): first case (left) and second case (right) 

the longest prefix of w which is represented. Assume further that a is the letter 
that follows prefix w' in w. Observe that w'a occurs at most \w'\ times, as the 
contrary would mean that w'a is represented too, which contradicts the choice of 
w'. Therefore, w'a occurs at most \w'\ times and so does w. □ 
Properties (in) and (iv) show that the position heap of a string "adapts" to the 
frequencies of its substrings. In particular, if a string is "frequent" (occurs as many 
times as it is long), then it is necessarily represented in the position heap. On the 
other hand, if it is not represented, it has less occurrences than its length. The 
latter property is crucial for obtaining a linear-time string matching algorithm of 
[EMOWllj . 

4 On-line construction algorithm 

Let us have a closer look at the properties of double nodes of a position heap PH{T). 
Each such node stores two positions i, j of T. Assume i < j, then positions i and j 
will be called the primary and the secondary positions respectively. 

Lemma 2 Let T — T[l..n\. If j < n is the secondary position of some node of 
PH{T), then so is j + I. 

Proof: Consider PH{T) for some string T[l..n]. Assume i,j, i < j, are respectively 
primary and secondary positions of some node. This means that by the time the 
suffix T[j..n] is inserted into PH{T) during its construction, node T[j..n] already 
exists. By Theorem [Tlf^ii^, node T[j + l..n] exists too. A fortiori, node T[j + l..n] 
exists when T[j + l..n] is inserted into PH(T). Therefore, j + 1 becomes the 
secondary position of that node after the insertion of suffix T[j + l..n]. □ 
Lemma [2] implies that all positions of T[l..n] are split into two intervals: primary 
positions [l..s — 1], for some position s, and secondary positions [s..n]. Position s 
will be called active secondary position, or active position for short. 

Assume we have constructed the position heap PH{T[l..k]) for some prefix 
T[l..fc] of the input string T[\..n]. Let us analyze the differences between PH{T\l..k]) 
and PH{T[\..k + 1]) and the modifications that need to be made to transform the 
former into the latter. 

Let s be the active position of T[l..fc]. First observe that for suffixes 1, . . . , s — 1, 
no changes need to be made. Inserting each suffix T[i..k] for 1 < i < s — 1 into 
PH{T[l..k]) led to the creation of a new node. This means that by the time this 
suffix was inserted into PH{T[\..k]), some prefix T[i..P\ of T[i..k], £ < k, was not 
represented in the position heap, which led to the creation of a new node T[i..£] with 
the minimal such £. This shows that inserting suffixes 1, . . . , s — 1 involve completely 
identical steps in the construction of both PH{T[l..k]) and PH{T[l..k + 1]). 

The situation is different for the secondary positions s, . . . ,k. Each suffix T[i..k] 
for s < i < k was already represented in PH{T[l..k]) at the moment of its in- 
sertion, and then resulted in the addition of the secondary position i to the node 
T[i..k]. When inserting the corresponding suffix T[i..k -\- 1] into the position heap 
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PH{T[l..k + 1]), two cases arise. In the first case, inserting the suffix T[i..k + 1] 
leads to the creation of the new node T[i..k + 1] if this node does not exist yet. Po- 
sition i then becomes the primary position of this new node. Observe that this only 
occurs when PH{T[l..k]) does not contain an T[k+ l]-edge outgoing from the node 
T[i..k]. It is easily seen that such an edge cannot appear by the time of insertion of 
T[i..k + 1] into PH{T[l..k + 1]) if it was not already present in PH{T[l..k]). In the 
second case, node T[i..k] has an outgoing T[k + l]-edge in PH{T[l..k]), and in the 
construction of PH{T[l..k+ 1]), the secondary position i stored in this node should 
be "moved" to the child node T[i..k + 1]. It becomes then the secondary position of 
this node. The two cases are illustrated in Figure [S] 

Observe now that if for a secondary position i, the corresponding node T[i..k] has 
an outgoing T[k + l]-edge, then so does the node T[i + l..k] storing the secondary 
position i + 1. This can again be seen from the factorial property of the position 
heap (Theorem (TJi^Mj). This shows that the above two cases split the interval of 
secondary positions [s..k] into two subintervals [s..t — 1] and [t..k], such that node 
T[i..k] does not have an outgoing T[k + l]-edge for i G [s..t — 1] and does have such 
an edge for i G [t..k]. 

The above discussion is summarized in the following lemma specifying the changes 
that have to be made to transform PH{T[l..k]) into PH{T[l..k + 1]). 

Lemma 3 Given T[l..n], consider PH{T[l..k]) for k < n. Let s he the active 
secondary position, stored in the node T[s..k\. Let t > s be the smallest position 
such that node T[t..k] has an outgoing T[k+l]-transition. To obtain PH{T[l..k+l]), 
PH(T[l..k]) should be modified in the following way: 

(i) for every node T[i..k], s < i < t — 1, create a new child linked to T[i..k] by a 
T[k + l]-edge. Delete secondary position i from the node T[i..k] and assign it 
as a primary position to the new node T[i..k + I], 

(ii) for every node T[i..k], i > t, move the secondary position i from node T[i..k] 
to node T[i..k + 1] . 

We describe now the algorithm implementing the changes specified by Lemma [3] 
We augment PH(T) with suffix pointers f defined in the usual way: 

Definition 1 For each node T[i..j] of PH{T), a suffix pointer is defined by 

f{Tf:j]) = n + 

Note that the definition is sound, as the node T[i + exists whenever the node 
T[i..j] exists, according to Theorem (Tyzij. For the root node, it will be convenient 
for us to define f{root) =_L, where ± is a special node such that there is an a-edge 
between ± and root for every a £ A (similar to Ukkonen's algorithm |Ukk95) ). 
Figure m shows the position heap of Figure [5] supplemented by suffix pointers. 

We now begin to describe the on-line construction algorithm for PH{T), given a 
text T[l..n]. Consider the node T[s..k] of PH(T[l..k]) storing the active secondary 
position s, that we call the active node. If the active secondary position does not 
exist (i.e. there is no secondary positions at all), then the active node is root 
and the active position is set to k + 1. Observe that the nodes storing the other 
secondary positions s + 1, s + 2, . . . ,n can be reached, in order, by following the 
chain of suffix pointers f{T[s..n]),f{f{T[s..n])), . . . until the root node is reached. 
On the example of Figure SI the active secondary position is 12, and the chain of 
suffix pointer outgoing from the active node leads to the node storing position 13 
followed by the root. 

This brings us to the main trick of our construction: we will not store secondary 
positions at all, but only memorize the active secondary position and the active 
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Figure 4: Position heap for string aababbbaabaab with sufBx pointers (dotted ar- 
rows). Secondary positions are shown in itahc. 

node. The secondary positions can be easily recovered by traversing the chain of 
suffix pointers starting from the active node and incrementing the position counter 
after traversing each edge. Note also that if the input string T is ended by a unique 
sentinel symbol, the resulting position heap does not contain any secondary nodes 
and there is no need to recover them. 

Keeping in mind that the secondary positions are not stored explicitly, the trans- 
formation of PH{T[l..k]) into PH{T[l..k + l]) specified by Lemma |3] reduces to pro- 
cessing case (i) only, as case (ii) does not imply any modification anymore. Case (i) 
is implemented by the following simple procedure. Starting from the active node, 
the algorithm traverses the chain of suffix pointers as long as the current node does 
not have an outgoing T[k + l]-edge. For each such node, a new node is created 
linked by a T[k + l]-edge to the current node. A suffix pointer to this new node 
is set from the previously created new node. Once the first node with an outgoing 
T[k + l]-edge is encountered, the algorithm moves to the node this edge leads to, 
sets the suffix pointer to this node, and assigns this node to be the active node 
for the following iteration. The correctness of the last assignment is stated in the 
following lemma. 

Lemma 4 Consider PH(T[l..k]) and let s be the active position, and t > s be the 
smallest position such that node T[t..k] has an outgoing T[k + l]-edge. Then node 
T[t..k+1] is the active node of PH{T[l..k + 1]). 

Proof: As it follows from Lemma[3l t is the largest secondary position of T[l..fc + 1]. 

□ 

Algorithm [1] provides a pseudo-code of the algorithm. 

The correctness of Algorithm [T] follows from Lemmas [3l |4] and the discussion 
above. It is instructive, in addition, to observe the following: 

• it is easily seen that the suffix pointers of PH{T[l..k + 1]) are correctly set. 
Indeed, the algorithm assigns to T[i..k -I- 1] a suffix pointer to T[i + l..k + 1] 
which is obviously correct. Note that for the active position s of T[l..fc], the 
created node T[s..k + 1] does not get pointed to by any suffix pointer, which 
is correct, as T[s — l..fc + 1] is not represented in PH {T\l..k + l] ): the position 
s — 1 is primary in T[l..fc] and therefore the node T[s — l..k], if it exists in 
PH{T[l..k]), does not get extended by a T[k + l]-edge (cf Lemma[2]). 

• since the depth of T[s..k] {s is the active position) in PH{T[l..k]) is k + 1 — s 
and a traversal of a suffix link decrements the depth by 1 and increments 
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Algorithm 1 On-line construction of the position heap PH{T[l..n]) 
1: create states root and _L 
2: f{root) <-_L 
3: for all a G A do 
4: set an a-edge from _L to root 
5: end for 

6: currentnode ■<— root 
7: currentsuffix 1 
8: for i = 1 to n do 

9: lastcreatednode <r- undefined 

10: while currentnode does not have an outgoing T[i]-edge do 
11: create a new node newnode pointing to currentsuffix 
12: set a r[i]-edge from currentnode to newnode 
13: if lastcreatednode ^ undefined then 
14: f (lastcreatednode) -f— newnode 

15: end if 

16: lastcreatednode newnode 
17: currentnode <— f [currentnode) 
18: currentsuffix <r- currentsuffx + 1 
19: end while 

20: move currentnode to the target node of the outgoing T[i]-edge 
21: if lastcreatednode ^ undefined then 
22: f {lastcreatednode) currentnode 
23: end if 
24: end for 



the current position by 1, it follows that if the traversal of the suffix chain 
reaches the root node, the active position value becomes k + which is exactly 
what we need to start processing the next letter T[fc + 1]. This shows why 
Algorithm [1] correctly maintains currentsuffix and never needs to reset it at 
the beginning of the for-loop iteration. 

It is easy to see that the running time of Algorithm [T] is linear in the length 
n of the input string. Since each iteration of the while-loop creates a node, this 
loop iterates exactly n times over the whole run of the algorithm. Trivially, the 
for-loop iterates n times too, and all the involved operations are constant time. 
Thus, the whole algorithm takes 0{n) time. The following theorem concludes the 
construction. 

Theorem 2 For an input string T[\..n], Algorithm{I\ correctly constructs PH(T) 
on-line in time 0{n). 

5 Augmented position heap 

Assume we have a text T[l..n] for which we constructed the position heap PH{T). 
We don't assume that T is ended by a unique letter, and therefore some nodes 
of PH{T) are double nodes and store two positions of T, one primary and one 
secondary. Here we assume that the secondary positions are actually stored (or can 
be retrieved in constant time for each node) . As explained in Section 21 even if the 
secondary positions are not stored during the construction of PH{T), they can be 
easily recovered once the construction is completed. 

|EMOWfT| proposed a linear-time string matching algorithm using PH{T[l..n]), 
i.e. an algorithm that computes all occurrences of a pattern string in T in time 
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0{m + occ), where m is the pattern length and occ the number of occurrences. 
Describing this elegant algorithm is beyond the scope of this paper, we refer the 
reader to jEMOWll] for its description. We only note that the algorithm itself 
applies without changes to our definition of position heap, as it does not depend in 
any way on the order that the suffixes of T are inserted. 

However, the algorithm of [EM OWTT] runs on PH{T) enriched with some ad- 
ditional information. Let i denote the node of PH{T) storing position i, 1 < i < n. 
The extended data structure, called the augmented position heap, should allow the 
following queries to be answered in constant time: 

• given a position i, retrieve the node i, 

• given two nodes i and j, is z a (not necessarily immediate) ancestor of j? 

• given a position i of T, retrieve the node T[i..i + where T[i..i + P\ is the 
longest substring of T starting at position i and represented in PH{T). 

To answer the first query, |EMOWTTj simply introduces an auxiliary array stor- 
ing, for each position i, a pointer to the node i. Maintaining this array during the 
construction of PH{T) by Algorithm [1] is trivial: once a position is assigned to a 
newly created node (line [TT] of Algorithm [ij , a new entry of the array is set. If T is 
not ended by a unique symbol and then the final PH{T) has secondary positions, 
those are easily recovered by traversing the chain of suffix pointers at the very end 
of the construction. 

The second query can be also easily answered in constant time after a linear-time 
preprocessing of PH{T). A solution proposed in [EMOWll' consists in traversing 
PH{T) depth-first and storing, for each node, its discovery and finishing times 
[CLR99| . Then node i is an ancestor of node j if and only if the discovery and 
finishing time of i is respectively smaller and greater than the discovery and finishing 
time of j . 

A more space-efficient solution would be to use a balanced parenthesis repre- 
sentation of the tree topology of PH{T), taking 2n bits, and link each node to the 
corresponding opening parenthesis. Then the corresponding closing parenthesis can 
be retrieved in constant time by the method of IMROlj using o{n) auxiliary bits. 
This allows ancestor queries to be answered in constant time. 

The third type of queries is answered by an additional mapping called maximal- 
reach pointer [EMOWTTj : for a position i of T[l..n], define mrpii) to be the node 
T[i..i -f £], where T[i..i + P\ is the longest prefix of T[i..n] represented in PH{T). 
Observe first that if i is a secondary position, then mrpii) = i. This is because 
a secondary position i is stored in node T[i..n], which trivially corresponds to the 
longest prefix starting at i. Therefore, as it is done in [EMOWTTj . mrp can be 
represented by pointers from node i to node mrp{i) whenever these nodes are dif- 
ferent. In our case, we have then to keep in mind that a maximal-reach pointer 
from a double node applies to the primary position of this node. Figure [5] provides 
an illustration. 

In [EMOWll] . maximal- reach pointers are computed by an extra traversal of 
PH{T), using an auxiliary dual heap structure on top of it (see Introduction). 
Here we show that maximal-reach pointers can be easily computed using suffix 
pointers instead of the dual heap. Thus, we completely get rid of the dual heap for 
constructing the augmented position heap, replacing it with suffix pointers. 

After PH{T) is constructed, we compute mrp{i) iteratively for j = 1, 2, . . . , s — 1, 
where s is the active secondary position of r[l..n]. Assume we have computed 
mrp{i) for some i and have to compute nirp{i 4- 1). Assume mrp{i) — T[i..i + £]. It 
is easily seen that T[i -\- -I- -^J is a prefix of the string represented by m,rp{i + 1). 
To compute mrp(k -\- 1), we follow the suffix link f(mrp{k)) to reach T[i + + £] 
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Figure 5: Position heap for string aababbbaabaab with suffix pointers and maximal- 
reach pointers mrp (double arrows). Only values for which mrp(i) ^ i are shown, 
namely mrp(l) = 11, mrp{8) = 11, mrp(2) = 9, mrp{3) = 7, mrp(7) = 10. Note 
that maximal reach pointers outgoing from double nodes are unambiguous as for 
all secondary positions z, we have mrp(i) = i. 

and then keep extending the prefix T[i + + £] as long as it is represented in 
PH{T). The resulting pseudo-code is given in Algorithm [2l 



Algorithm 2 Linear-time computation of maximal-reach pointers mrp{i) 
1: currentnode ■<— root 
2: readhead -S— 1 
3: for i = 1 to n do 

4: while currentnode has an outgoing T[readhead]-edge and readhead < n do 
5: move currentnode to the target node of the outgoing T[readhead]-edge 
6: readhead <— readhead + 1 
7: end while 
8: mrp{i) currentnode 
9: currentnode f (currentnode) 
10: end for 



It is easy to see that Algorithm [2] works in time 0{n): the while- loop makes 
exactly n iterations overall, as each iteration increments the readhead counter. 

The following property of Algorithm [2] is useful to observe: as soon as readhead 
gets the value n-l-1 (line|6]), the node currentnode gets assigned to the active node of 
PH{T[l..n]) (line 19]); at the subsequent iterations, the algorithm simply traverses 
the chain of sufEx links and sets the maximal-reach pointer for each secondary 
position to be the node storing this position (lines [5][H1). 

Maximal- reach pointers constitute an additional data structure on top of the tree 
structure of the position heap. However, it is interesting to note that this structure 
can be represented compactly in 0{n) bits so that mrp{i) can be computed in 
constant time. Here is how it can be done. 

As observed earlier, mrp{i) > mrp(i — 1) — 1 for all i G [2..n]. Define Si — 
mrp(i) — mrp{i — 1) + 1 and observe that mrp{l) + X)r=2 ~ Represent the 
vector (mrp( 1 ), (52, (5n) as a binary vector Bmrp by representing all values in 
unary followed by a 0. For example, vector (1,2, 0,3, 0,1,0) is then represented as 
10110011100100. Note that the length of B^^p is 2n. To Bmrp, we will be applying 
rank and select operations. Recall that for a binary vector B, ranki(B, i) returns the 
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number of I's occurring in B[l..i], and selecti(B,i?) returns the position of the i-th 
occurrence of 1 in B (counting from left), rankp and selecto are defined similarly. 
It is known that the input binary vector of length n can be pre-processed using 
o{n) additional memory bits, so that rank and select queries can be answered in 
time 0(1) [Jac891 ICM96| . Observe now that mrp{i) — ranki(selecto(i)) — i + 1. 
Therefore, mrp(i) can be computed in constant time. We summarize this in the 
following Lemma. 

Lemma 5 For the position heap PH{T) of any text T[l..n\, the maximal-reach 
pointers mrpii), i G [l•.?^], can he stored in 2n + o(n) bits so that each mrp(i) can 
be recovered in time 0{1). 



6 Concluding remarks 

We proposed a construction algorithm of a position heap of a string, under a modi- 
fied definition of position heap compared to |EMOWfT| . In contrast with the algo- 
rithm of 'EMOWlT that processes the sequence right-to- left, our algorithm reads 
the string left-to-right and has the on-line property. Drawing a parallel to suffix 
trees, our algorithm can be compared to Ukkonen's on-line algorithm [Ukk95| . while 
the algorithm of [EMOWTT] can be compared to Weiner's algorithm |Wei73| . The 
similarity of our algorithm to Ukkonen's algorithm goes beyond this parallel, as the 
execution of both algorithms (e.g. the way of traversing the tree under construction, 
or updating the active node) are clearly analogous. 

Position heap is a smaller data structure than suffix tree: it contains exactly 
n + 1 nodes whereas the suffix tree has n leaves and then up to 2n nodes. Still, the 
position heap allows for a linear-time string matching. The position heap is a new 
data structure and many questions about it are open. 

The 0{n) complexity bounds of both Algorithm[T] (Theorem[2]) and Algorithm[2] 
are stated for a constant-size alphabet, otherwise a correcting factor log \A\ should 
be introduced, similarly to the suffix tree construction. One may ask if there exists 
a linear-time algorithm (not necessarily on-line) which constructs position heaps 
within a time independent of the alphabet size, as Farach's algorithm does for suffix 
trees |Far97j. 

An interesting direction to study is whether the position heap can be compacted. 
The theory of compact data structures has become a major subfield of string process- 
ing, and has accumulated a number of interesting and powerful techniques NM07] . 
In this paper, we showed that some components of the position heap can be effec- 
tively compacted, however the compaction of its main part (the trie) is still to be 
studied. 

It would be interesting to study combinatorial properties of position heaps. In 
combinatorial terminology, a tree with n nodes labeled by distinct integers from 
{1, . . . ,n} and such that the label of any node is smaller than the label of any of its 
descendant is called an increasing tree. It is known, for example, that there are nl 
ordered binary increasing trees [Sta99| ("ordered" means that left and right children 
are distinguished), which implies that there are (n -I- 1)! "potential position heaps" 
over binary alphabet. Obviously, there are only 2" different position heaps over the 
binary alphabet. It would be interesting to establish combinatorial properties that 
distinguish arbitrary increasing trees from position heaps. 

The authors of [EMOWlT] proposed algorithms for updating the position heap 
when the input string undergoes modifications (character insertions/deletions). We 
believe that these algorithms can be easily applied to our definition of position heap. 



Recently, the authors of Nil"*" 12 showed that the position heap can be generalized 



to a set of strings stored in a trie such that the construction and pattern matching 
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remain linear-time. Other interesting applications of position heap are still to be 
discovered. 

From a more practical perspective, it would be also interesting to exploit the 
"adaptiveness" of position heaps to substring frequencies, mentioned in Section |31 
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